dataset creator
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.04)
- (12 more...)
- Workflow (0.67)
- Overview (0.67)
- Research Report > New Finding (0.45)
- Information Technology (1.00)
- Health & Medicine (1.00)
- Energy (1.00)
- (3 more...)
- Law (1.00)
- Government (0.68)
- Information Technology > Security & Privacy (0.47)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.04)
- (12 more...)
- Workflow (0.67)
- Overview (0.67)
- Research Report > New Finding (0.45)
- Information Technology (1.00)
- Health & Medicine (1.00)
- Energy (1.00)
- (2 more...)
- Law (1.00)
- Government (0.68)
- Information Technology > Security & Privacy (0.47)
Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators
The increasing demand for high-quality datasets in machine learning has raised concerns about the ethical and responsible creation of these datasets. Dataset creators play a crucial role in developing responsible practices, yet their perspectives and expertise have not yet been highlighted in the current literature. In this paper, we bridge this gap by presenting insights from a qualitative study that included interviewing 18 leading dataset creators about the current state of the field. We shed light on the challenges and considerations faced by dataset creators, and our findings underscore the potential for deeper collaboration, knowledge sharing, and collective development. Through a close analysis of their perspectives, we share seven central recommendations for improving responsible dataset creation, including issues such as data quality, documentation, privacy and consent, and how to mitigate potential harms from unintended use cases. By fostering critical reflection and sharing the experiences of dataset creators, we aim to promote responsible dataset creation practices and develop a nuanced understanding of this crucial but often undervalued aspect of machine learning research.
- North America > United States > California (0.28)
- North America > United States > New York > New York County > New York City (0.04)
- Oceania > Australia (0.04)
- (7 more...)
Technical Debt In Machine Learning System - A Model Driven Perspective - DataScienceCentral.com
This article is part 2 of the two part series on Technical Debt in Machine Learning Systems development. Introduced a simple yet powerful Model of Technical Debt for Machine Learning Systems. The model is simple to remember, easier to extend, and provides a reliable means for reliable and maintainable Machine Learning Systems. This, in a nutshell, is the value proposition of this post. Introduced four dimensions of the Model, namely, Time, Input, System and Output.
Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata
Heger, Amy K., Marquis, Liz B., Vorvoreanu, Mihaela, Wallach, Hanna, Vaughan, Jennifer Wortman
Data is central to the development and evaluation of machine learning (ML) models. However, the use of problematic or inappropriate datasets can result in harms when the resulting models are deployed. To encourage responsible AI practice through more deliberate reflection on datasets and transparency around the processes by which they are created, researchers and practitioners have begun to advocate for increased data documentation and have proposed several data documentation frameworks. However, there is little research on whether these data documentation frameworks meet the needs of ML practitioners, who both create and consume datasets. To address this gap, we set out to understand ML practitioners' data documentation perceptions, needs, challenges, and desiderata, with the goal of deriving design requirements that can inform future data documentation frameworks. We conducted a series of semi-structured interviews with 14 ML practitioners at a single large, international technology company. We had them answer a list of questions taken from datasheets for datasets (Gebru, 2021). Our findings show that current approaches to data documentation are largely ad hoc and myopic in nature. Participants expressed needs for data documentation frameworks to be adaptable to their contexts, integrated into their existing tools and workflows, and automated wherever possible. Despite the fact that data documentation frameworks are often motivated from the perspective of responsible AI, participants did not make the connection between the questions that they were asked to answer and their responsible AI implications. In addition, participants often had difficulties prioritizing the needs of dataset consumers and providing information that someone unfamiliar with their datasets might need to know. Based on these findings, we derive seven design requirements for future data documentation frameworks.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Wisconsin > Milwaukee County > Milwaukee (0.04)
- North America > United States > Washington > King County > Redmond (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Questionnaire & Opinion Survey (1.00)
- Law (1.00)
- Information Technology (1.00)
- Health & Medicine (1.00)
ShortcutLens: A Visual Analytics Approach for Exploring Shortcuts in Natural Language Understanding Dataset
Jin, Zhihua, Wang, Xingbo, Cheng, Furui, Sun, Chunhui, Liu, Qun, Qu, Huamin
Benchmark datasets play an important role in evaluating Natural Language Understanding (NLU) models. However, shortcuts -- unwanted biases in the benchmark datasets -- can damage the effectiveness of benchmark datasets in revealing models' real capabilities. Since shortcuts vary in coverage, productivity, and semantic meaning, it is challenging for NLU experts to systematically understand and avoid them when creating benchmark datasets. In this paper, we develop a visual analytics system, ShortcutLens, to help NLU experts explore shortcuts in NLU benchmark datasets. The system allows users to conduct multi-level exploration of shortcuts. Specifically, Statistics View helps users grasp the statistics such as coverage and productivity of shortcuts in the benchmark dataset. Template View employs hierarchical and interpretable templates to summarize different types of shortcuts. Instance View allows users to check the corresponding instances covered by the shortcuts. We conduct case studies and expert interviews to evaluate the effectiveness and usability of the system. The results demonstrate that ShortcutLens supports users in gaining a better understanding of benchmark dataset issues through shortcuts, inspiring them to create challenging and pertinent benchmark datasets.
- Asia > China > Hong Kong (0.04)
- North America > United States > New York > Suffolk County > Stony Brook (0.04)
- Europe > Ireland (0.04)
- (3 more...)
Datasheets for Datasets
Data plays a critical role in machine learning. Every machine learning model is trained and evaluated using data, quite often in the form of static datasets. The characteristics of these datasets fundamentally influence a model's behavior: a model is unlikely to perform well in the wild if its deployment context does not match its training or evaluation datasets, or if these datasets reflect unwanted societal biases. Mismatches like this can have especially severe consequences when machine learning models are used in high-stakes domains, such as criminal justice,1,13,24 hiring,19 critical infrastructure,11,21 and finance.18 Even in other domains, mismatches may lead to loss of revenue or public relations setbacks.
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > Maryland > Prince George's County > College Park (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- (5 more...)
- Law (1.00)
- Information Technology > Security & Privacy (0.69)